Introduction

The purpose of this project

The purpose of this project is to generate a model that will predict whether people will get heart disease or not based on some key indicators.

Some facts you need to know about heart disease

“According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol.” (From Kaggle website)

In order to give you a better understanding of heart disease, its harm and the key factors that may cause it, here are a few videos for you:

#install.packages('vembedr')
library(vembedr)
embed_url("https://www.youtube.com/watch?v=g131j2lb3xw")
embed_url("https://www.youtube.com/watch?v=u7k6sqTxOCU&t=139s")

Why might this model be useful?

In fact, not only in the United States, heart disease is also an important cause of death in the world. Moreover, the number of deaths from heart disease is increasing every year in the world. Therefore, we need a model to predict people’s risk of heart disease according to some important indicators, such as their BMI, whether they smoke, whether they drink alcohol and so on. By using this model, we can remind people to pay attention to health and lifestyle habits to help them avoid heart disease.

An overview of my dataset

This project uses Kamil Pytlak’s dataset from Kaggle.

According to Kamil Pytlak, his dataset originally comes from the 2020 annual CDC survey data of 400k adults related to their health status. However, he cleaned the original CDC survey data and selected the most relevant variables from it in order to help us to do the machine learning projects related to heart disease.

This dataset contains 319795 observations with 18 variables (9 booleans, 5 strings and 4 decimals). HeartDisease is the response variable. Other 17 variables are our predictors. The full copy of the codebook of this unprocessed dataset available in my zipped files, but in order to help us to better understand these variables,I also show some important parts of this unprocessed dataset’s codebook of here.

  • HeartDisease: Whether the respondent has ever been diagnosed with heart disease

  • BMI: The body mass index of the respondent

  • Smoking: Whether the respondent has smoked 100 cigarettes in his/her entire life. [ Note: 5 packs = 100 cigarettes ]

  • AlcoholDrinking: Whether the respondent is a heavy drinkers [ Note: A heavy drinker is a adult men who having more than 14 drinks per week or a adult women who having more than 7 drinks per week ]

  • Stroke: Whether the respondent had a stroke

  • PhysicalHealth: The number of days that the respondent had poor physical health in the past 30 days [ Note: It includes physical illness and injury ]

  • MentalHealth: The number of days that the respondent had poor mental health in the past 30 days

  • DiffWalking: Whether the respondent has serious difficulty walking or climbing stairs

  • Sex: Whether the respondent is male or female

  • AgeCategory: What age group is the respondent in [ Note: the answer should be ‘18-24’, ‘25-29’, ‘30-34’, ‘35-39’, ‘40-44’, ‘45-49’, ‘50-54’, ’55-59’, ’60-64’, ’65-69’, ’70-74’, ’75-79’, ’80 or older’]

  • Race: The race of the respondent [Note: the answer should be ‘American Indian/Alaskan Native’, ‘Asian’, ‘Black’, ‘Hispanic’, ‘White’, ‘Other’]

  • Diabetes: Whether the respondent had diabetes [ Notes : the answer should be ‘No’, ’No, borderline diabetes’, ‘Yes’, ‘Yes (during pregnancy)’ ]

  • PhysicalActivity: Whether the respondent did physical activity or exercise during the past 30 days other than their regular job

  • GenHealth: The respondent’s health assessment of his/her self in general [ Notes : the answer should be ‘Very good’, ‘Good’, ‘Excellent’, ‘Fair’, ‘Poor’ ]

  • SleepTime: The number of hours of sleep of the respondent in a 24-hour period

  • Asthma: Whether the respondent had asthma

  • KidneyDisease: Whether the respondent had kidney disease [ Note: not including kidney stones, bladder infection or incontinence ]

  • SkinCancer: Whether the respondent had skin cancer

Note: a full copy of the codebook is available in my zipped files.

Loading Data and Packages

# install.packages('caret')
# install.packages("ROSE")
# load packagges
library(tidyverse)
library(tidymodels)
library(ISLR)
library(ISLR2)
library(discrim)
library(corrr)
library(rpart.plot)
library(vip)
library(janitor)
library(randomForest)
library(xgboost)
library(corrplot)
library(glmnet)
library(ranger)
library(caret)
library(klaR)
library(dplyr)
tidymodels_prefer()
library(ROSE)
library(kknn)
set.seed(1234)
# read the the dataset
records<- read.csv("heart_2020_cleaned.csv")
head(records) # show the first few rows of the dataset
##   HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 1           No 16.60     Yes              No     No              3           30
## 2           No 20.34      No              No    Yes              0            0
## 3           No 26.58     Yes              No     No             20           30
## 4           No 24.21      No              No     No              0            0
## 5           No 23.71      No              No     No             28            0
## 6          Yes 28.87     Yes              No     No              6            0
##   DiffWalking    Sex AgeCategory  Race Diabetic PhysicalActivity GenHealth
## 1          No Female       55-59 White      Yes              Yes Very good
## 2          No Female 80 or older White       No              Yes Very good
## 3          No   Male       65-69 White      Yes              Yes      Fair
## 4          No Female       75-79 White       No               No      Good
## 5         Yes Female       40-44 White       No              Yes Very good
## 6         Yes Female       75-79 Black       No               No      Fair
##   SleepTime Asthma KidneyDisease SkinCancer
## 1         5    Yes            No        Yes
## 2         7     No            No         No
## 3         8    Yes            No         No
## 4         6     No            No        Yes
## 5         8     No            No         No
## 6        12     No            No         No

Data Cleaning

While the data set that was downloaded was tidy, a few different cleaning steps were necessary before the split occurred:

  • Clean names
records <-records %>% clean_names()
head(records)
##   heart_disease   bmi smoking alcohol_drinking stroke physical_health
## 1            No 16.60     Yes               No     No               3
## 2            No 20.34      No               No    Yes               0
## 3            No 26.58     Yes               No     No              20
## 4            No 24.21      No               No     No               0
## 5            No 23.71      No               No     No              28
## 6           Yes 28.87     Yes               No     No               6
##   mental_health diff_walking    sex age_category  race diabetic
## 1            30           No Female        55-59 White      Yes
## 2             0           No Female  80 or older White       No
## 3            30           No   Male        65-69 White      Yes
## 4             0           No Female        75-79 White       No
## 5             0          Yes Female        40-44 White       No
## 6             0          Yes Female        75-79 Black       No
##   physical_activity gen_health sleep_time asthma kidney_disease skin_cancer
## 1               Yes  Very good          5    Yes             No         Yes
## 2               Yes  Very good          7     No             No          No
## 3               Yes       Fair          8    Yes             No          No
## 4                No       Good          6     No             No         Yes
## 5               Yes  Very good          8     No             No          No
## 6                No       Fair         12     No             No          No
  • Deal with imbalanced problems Now, let check whether our response variable is balanced or not. If not, we need to deal with it.
table(records$heart_disease)
## 
##     No    Yes 
## 292422  27373

We find that our response variable is highly imbalanced. There are much more observations on ‘No’ levels than ‘Yes’ levels. We need to use some functions to deal with this problem, otherwise it will have a serious impact on our predictions. According to the TA, we can use ovun.sample() function to help us to deal with it.

set.seed(1234)
newrecords<- ovun.sample(heart_disease~.,data = records,
                             p=0.5,seed = 1,method = "under")$data
  • Check whether our response variable are balanced or not.
table(newrecords$heart_disease)
## 
##    No   Yes 
## 27387 27373

Although there are more ‘No’, our response variable is almost balanced.

  • Check if there is a missing value. If yes, we need to remove the observation with the missing value. If no, we continue the process of data cleaning.
sum(is.na(newrecords)) 
## [1] 0
# It means there is no missing value in our data. 
# We continue the process of data cleaning.
  • Convert heart_disease, smoking, alcohol_drinking, stroke, diff_walking, sex, age_category, race, diabetic, physical_activity, gen_health, asthma, kidney_disease, skin_cancer to factors
newrecords <- newrecords %>% 
  mutate(
    heart_disease  = factor(heart_disease, levels = c('Yes', 'No')),
    smoking  = factor(smoking, levels = c('Yes', 'No')),
    alcohol_drinking  = factor(alcohol_drinking, levels = c('Yes', 'No')),
    stroke  = factor(stroke, levels = c('Yes', 'No')),
    diff_walking  = factor(diff_walking, levels = c('Yes', 'No')),
    sex  = factor(sex),
    age_category  = factor(age_category),
    race  = factor(race, levels = c("American Indian/Alaskan Native", "Asian", "Black", "Hispanic", "White", "Other")),
    diabetic  = factor(diabetic, levels = c("No", "No, borderline diabetes", "Yes", "Yes (during pregnancy)")),
    physical_activity  = factor(physical_activity),
    gen_health  = factor(gen_health, levels = c("Excellent","Very good", "Good", "Fair", "Poor" )),
    asthma  = factor(asthma, levels = c('Yes', 'No')),
    kidney_disease  = factor(kidney_disease, levels = c('Yes', 'No')),
    skin_cancer  = factor(skin_cancer, levels = c('Yes', 'No')),
  )
head(newrecords) 
##   heart_disease   bmi smoking alcohol_drinking stroke physical_health
## 1            No 33.84      No               No     No               0
## 2            No 31.75      No              Yes     No               0
## 3            No 33.64      No               No     No               0
## 4            No 24.56      No               No     No               0
## 5            No 40.69     Yes               No     No              30
## 6            No 27.89      No               No     No               0
##   mental_health diff_walking    sex age_category  race diabetic
## 1             2           No Female        45-49 White       No
## 2             0           No   Male        55-59 White       No
## 3            28           No   Male        40-44 Black      Yes
## 4             0           No Female        40-44 Asian       No
## 5             0          Yes   Male        60-64 White      Yes
## 6             0           No   Male        75-79 White       No
##   physical_activity gen_health sleep_time asthma kidney_disease skin_cancer
## 1               Yes  Very good          6     No             No         Yes
## 2               Yes  Excellent          7     No             No         Yes
## 3               Yes       Good          7     No             No          No
## 4               Yes  Excellent          6     No             No          No
## 5               Yes       Fair          7     No             No          No
## 6               Yes  Very good          5     No             No         Yes
# show me how many observations in the new dataset
# show me how many variables in the new dataset
dim(newrecords) 
## [1] 54760    18
  • We completed the the process of data cleaning.

Exploratory Data Analysis

This entire exploratory data analysis will be based only on the entire set, which has 54760 observations with 18 variables. Each observation represents a single newrecords class.

Variable heart_disease

  • During the process of exploratory data analysis, we first analyze our response variable heart_disease.
newrecords %>% 
  ggplot(aes(x = heart_disease)) +
  geom_bar() 

  • According to the graph, we find that our response variable is almost balanced. Although there are much more observations on ‘No’ levels than ‘Yes’ levels, this will not have a significant impact on our predictions.

Variable bmi

Now, let’s analyze variable bmi.

  • First, draw a histogram of variable bmi.
hist(newrecords$bmi, main = paste("Histogram of BMI"), xlab = 'The value of BMI', ylab = 'The number of respondents')

* The distribution of bmi definitely appears to be left skewed, and it has a long right tail. It also almost looks a normal distribution. There’s one peak around 25-30. Most people have a BMI below 40.

  • Second, draw a boxplot of variable bmi by heart_disease
newrecords %>%
  ggplot(aes(x = bmi, y=heart_disease))+
  geom_boxplot() +
  xlab("The BMI of the respondent") 

* Based on the graph, we can find that the respondent who have higher BMI is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable smoking

  • First, draw a plot of variable smoking.
newrecords %>% 
  ggplot(aes(x = smoking)) +
  geom_bar() 

* According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable smoking.

  • Second, draw a plot of variable smoking by heart_disease
newrecords %>%
  ggplot(aes(x= smoking, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that the respondent who likes smoking (has smoked 100 cigarettes in his/her entire life) is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable alcohol_drinking

  • First, draw a plot of variable alcohol_drinking.
newrecords %>% 
  ggplot(aes(x = alcohol_drinking)) +
  geom_bar() 

  • According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable alcohol_drinking. For this reason, it maybe hard for us to find the relationship between alcohol_drinking and heart_disease.

  • Second, draw a plot of variable alcohol_drinking by heart_disease

newrecords %>%
  ggplot(aes(x= alcohol_drinking, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that the respondent who is not heavy drinker is more likely to get heart disease. [A heavy drinker is a adult men who having more than 14 drinks per week or a adult women who having more than 7 drinks per week ]
  • Notice that the result may not be correct since we have much more observations on ‘No’ levels than ‘Yes’ levels for variable alcohol_drinking. ( We will confirm this result at the end of this project )

Variable Stroke

  • First, draw a plot of variable stroke.
newrecords %>% 
  ggplot(aes(x = stroke)) +
  geom_bar() 

  • According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable stroke.For this reason, it maybe hard for us to find the relationship between stroke and heart_disease.

  • Second, draw a plot of variable stroke by heart_disease

newrecords %>%
  ggplot(aes(x= stroke, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that the respondent who had stoke is more likely to get heart disease.( We will confirm this result at the end of this project )

Variable physical_health

  • First, draw a histogram of variable physical_health: The number of days that the respondent had poor physical health in the past 30 days [ Note: It includes physical illness and injury ]
hist(newrecords$physical_health, main = paste("Histogram of physical_health"), xlab = 'The number of days that the respondent had poor physical health in the past 30 days', ylab = 'The number of respondents')

  • According to the graph, we find that most of respondents had less than 10 poor physical health day in the past 30 days, and there are some respondents had 30 poor physical health day in the past 30 days.

  • Second, draw a boxplot of variable physical_health by heart_disease

newrecords %>%
  ggplot(aes(x =physical_health , y=heart_disease))+
  geom_boxplot() +
  xlab("The number of days that the respondent had poor physical health in the past 30 days") 

  • Based on the graph, we can find that the number of days that the respondents had poor physical health in the past 30 days may affect whether they have a heart disease. ( We will confirm this result at the end of this project )

Variable mental_health

  • First, draw a histogram of variable mental_health: The number of days that the respondent had poor mental health in the past 30 days
hist(newrecords$mental_health, main = paste("Histogram of mental_health"), xlab = 'The number of days that the respondent had poor mental health in the past 30 days', ylab = 'The number of respondents')

  • According to the graph, we find that most of respondents had less than 10 poor mental health day in the past 30 days, and there are some respondents had 30 poor mental health days in the past 30 days.

  • Second, draw a boxplot of variable mental_health by heart_disease

newrecords %>%
  ggplot(aes(x =mental_health , y=heart_disease))+
  geom_boxplot() +
  xlab("The number of days that the respondent had poor mental health in the past 30 days") 

  • Based on the graph, we can find that the number of days that the respondents had poor mental health in the past 30 days may affect whether they have a heart disease. ( We will confirm this result at the end of this project )

Variable diff_walking

  • First, draw a plot of variable diff_walking.
newrecords %>% 
  ggplot(aes(x = diff_walking)) +
  geom_bar() 

  • According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable diff_walking.For this reason, it maybe hard for us to find the relationship between diff_walking and heart_disease.

  • Second, draw a plot of variable diff_walking by heart_disease

newrecords %>%
  ggplot(aes(x= diff_walking, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that the respondent who has serious difficulty walking or climbing stairs is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable sex

  • First, draw a plot of variable sex.
newrecords %>% 
  ggplot(aes(x = sex)) +
  geom_bar() 

  • According to the graph, we find that there are more observations on ‘Male’ levels than ‘Female’ levels for variable sex. Since the difference between these two levels is no very big, it won’t make bad influence.

  • Second, draw a plot of variable sex by heart_disease

newrecords %>%
  ggplot(aes(x= sex, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that the respondent who is male is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable age_category

  • First, draw a plot of variable age_category.
newrecords %>% 
  ggplot(aes(x = age_category)) +
  geom_bar() 

  • According to the graph, we find that there are more older respondents (older than 50) than young respondents ( younger than 50).

  • Second, draw a plot of variable age_category by heart_disease

newrecords %>%
  ggplot(aes(x= age_category, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that older people (older than 50) are more likely to get heart disease than younger people (less than 50). ( We will confirm this result at the end of this project )

Variable race

  • First, draw a plot of variable race.
newrecords %>% 
  ggplot(aes(x = race)) +
  geom_bar() 

  • According to the graph, we find that most of the respondents are white. For this reason, it maybe hard for us to find the relationship between race`` andheart_disease`.

  • Second, draw a plot of variable race by heart_disease

newrecords %>%
  ggplot(aes(x= race, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal() 

  • Based on the graph, although it is difficult for us to tell which race is more likely to heart disease, we can find that the probability of heart disease of different races is not the same. ( We will confirm this result at the end of this project )

Variable diabetic

  • First, draw a plot of variable diabetic.
newrecords %>% 
  ggplot(aes(x = diabetic)) +
  geom_bar() 

  • According to the graph, we find that most of the respondents are on level ‘No’, and some of the respondents are on level ‘Yes’. For this reason, it maybe hard for us to find the relationship between diabetic and heart_disease.

  • Second, draw a plot of variable diabetic by heart_disease

newrecords %>%
  ggplot(aes(x= diabetic, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal() 

  • Based on the graph, we can find that the respondent who had diabetic is more likely to get heart disease. ( We will confirm this result at the end of this project )
head(newrecords)
##   heart_disease   bmi smoking alcohol_drinking stroke physical_health
## 1            No 33.84      No               No     No               0
## 2            No 31.75      No              Yes     No               0
## 3            No 33.64      No               No     No               0
## 4            No 24.56      No               No     No               0
## 5            No 40.69     Yes               No     No              30
## 6            No 27.89      No               No     No               0
##   mental_health diff_walking    sex age_category  race diabetic
## 1             2           No Female        45-49 White       No
## 2             0           No   Male        55-59 White       No
## 3            28           No   Male        40-44 Black      Yes
## 4             0           No Female        40-44 Asian       No
## 5             0          Yes   Male        60-64 White      Yes
## 6             0           No   Male        75-79 White       No
##   physical_activity gen_health sleep_time asthma kidney_disease skin_cancer
## 1               Yes  Very good          6     No             No         Yes
## 2               Yes  Excellent          7     No             No         Yes
## 3               Yes       Good          7     No             No          No
## 4               Yes  Excellent          6     No             No          No
## 5               Yes       Fair          7     No             No          No
## 6               Yes  Very good          5     No             No         Yes

Variable physical_activity

  • First, draw a plot of variable physical_activity. physical_activity: Whether the respondent did physical activity or exercise during the past 30 days other than their regular job.
newrecords %>% 
  ggplot(aes(x = physical_activity)) +
  geom_bar() 

  • According to the graph, we find that most of the respondents are on level ‘Yes’, and some of the respondents are on level ‘No’. For this reason, it maybe hard for us to find the relationship between physical_activity and heart_disease.

  • Second, draw a plot of variable physical_activity by heart_disease

newrecords %>%
  ggplot(aes(x= physical_activity, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal() 

  • Based on the graph, we can find that the respondent who didn’t do physical activity or exercise during the past 30 days other than their regular job is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable gen_health

  • First, draw a plot of variable gen_health. gen_health: The respondent’s health assessment of his/her self in general [ Notes : the answer should be ‘Very good’, ‘Good’, ‘Excellent’, ‘Fair’, ‘Poor’ ]
newrecords %>% 
  ggplot(aes(x = gen_health)) +
  geom_bar() 

  • According to the plot, we can see that most of respondents think they are in good and above health, and there are some respondents think they are in fair or poor health.

  • Second, draw a plot of variable gen_health by heart_disease

newrecords %>%
  ggplot(aes(x= gen_health, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal() 

  • Based on the graph, we can find that the respondent who think he/she are in poor health is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable sleep_time

  • First, draw a histogram of variable sleep_time
hist(newrecords$sleep_time, main = paste("Histogram ofsleep time "), xlab = 'The number of hours of sleep of the respondent in a 24-hour period', ylab = 'The number of respondents')

  • According to the graph, we know that the distribution of sleep_time definitely appears to be left skewed, and it has a long right tail. It also almost looks a normal distribution. There’s one peak around 7-8 hour. Most people have a sleep time between 5- 8 hour.

  • Second, draw a boxplot of variable sleep_time by heart_disease

newrecords %>%
  ggplot(aes(x = sleep_time, y=heart_disease))+
  geom_boxplot() +
  xlab("The number of hours of sleep of the respondent in a 24-hour period") 

  • From the graph we got, since the boxplots of the two levels (‘Yes’, ‘No’) are very similar and the medians are very close to each other, it is very hard for us to tell whether the length of the sleep time is related to heart disease or not. We will explore it more at the modeling part.

Variable asthma

  • First, draw a plot of variable asthma
newrecords %>% 
  ggplot(aes(x = asthma)) +
  geom_bar() 

  • According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable asthma.For this reason, it maybe hard for us to find the relationship between asthma and heart_disease.

  • Second, draw a plot of variable asthma by heart_disease

newrecords %>%
  ggplot(aes(x= asthma, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

* Based on the graph, we can find that the respondent who had asthma is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable kidney_disease

  • First, draw a plot of variable kidney_disease
newrecords %>% 
  ggplot(aes(x = kidney_disease)) +
  geom_bar() 

  • According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable kidney_disease.For this reason, it maybe hard for us to find the relationship between kidney_disease and heart_disease.

  • Second, draw a plot of variable kidney_disease by heart_disease

newrecords %>%
  ggplot(aes(x= kidney_disease, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

* Based on the graph, we can find that the respondent who had kidney_disease is more likely to get heart disease. ( We will confirm this result at the end of this project )

Variable skin_cancer

  • First, draw a plot of variable skin_cancer
newrecords %>% 
  ggplot(aes(x = skin_cancer)) +
  geom_bar() 

  • According to the graph, we find that there are much more observations on ‘No’ levels than ‘Yes’ levels for variable skin_cancer.For this reason, it maybe hard for us to find the relationship between skin_cancer and heart_disease.

  • Second, draw a plot of variable skin_cancer by heart_disease

newrecords %>%
  ggplot(aes(x= skin_cancer, y= heart_disease , fill= heart_disease)) +
  geom_bar(stat="identity")+theme_minimal()

  • Based on the graph, we can find that the respondent who had skin_cancer is more likely to get heart disease. ( We will confirm this result at the end of this project )

We completed the process of exploratory data analysis.

Data Split

The data was split in a 80% training, 20% testing split. Stratified sampling was used as the heart_disease distribution was skewed. (See more on that in the EDA).

The data split was conducted prior to the EDA as I did not want to know anything about my testing data set before I tested my model on those observations.

set.seed(1234)
newrecords_split <- newrecords %>% 
  initial_split(prop = 0.8, strata = "heart_disease")

newrecords_train <- training(newrecords_split)
newrecords_test <- testing(newrecords_split)
# check whether the training and testing data sets 
# have the appropriate number of observations. 
dim(newrecords)
## [1] 54760    18
dim(newrecords_train)
## [1] 43807    18
dim(newrecords_test) 
## [1] 10953    18
  • The training data set has about \(54760 *0.80\) = 43808 observations and the testing data set has just under \(54760*0.20\) = 10952 observations. So, according to what we got from the code and our calculations, we can conclude that our training and testing data sets have the appropriate number of observations.

Model Fitting

In this part, I decided to use 4 different model classes (6 models in total).

  • Class 1: Logistic regression, LDA and QDA

  • Class 2: Boosted tree

  • Class 3: Random forest

  • Class 4: Nearest Neighbors

Building the Recipe

  • Using the training data, create a recipe predicting the outcome variable heart_disease. Include the following predictors:bmi,smoking,alcohol_drinking,stroke,physical_health, mental_health, diff_walking, sex, age_category,race,diabetic, physical_activity,gen_health ,sleep_time,asthma, kidney_disease,skin_cancer
  • Dummy-code smoking, alcohol_drinking,stroke,diff_walking, sex, age_category, race, diabetic, physical_activity, gen_health, asthma,kidney_disease, skin_cancer; (According to the lecture, we should encode all nominal predictors.)
  • Center and scale all predictors.
recipe <- recipe(
  heart_disease ~ smoking+ alcohol_drinking+ stroke,physical_health+ mental_health+diff_walking+sex+age_category+race+diabetic+physical_activity+gen_health+sleep_time+asthma+kidney_disease+skin_cancer, data = newrecords_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

Uses cross-validation to fold training set.

  • We use v-fold cross-validation on the training set. Use 5 folds.
  • We stratify the folds by heart_disease as well.
  • Stratifying the folds can be useful since it can makes sure the distribution of a heart_disease (often the outcome) remains the same across resamples or, in cross-validation, across folds.
set.seed(1234)
newrecords_folds <- vfold_cv(data = newrecords_train, v = 5, strata = heart_disease )

Class 1: Logistic regression, LDA and QDA

Since these three models belong to the same model class, we want to select the best model among the three models to represent the best model in this class. We will finally use the four models from 4 different model classes to select the best model of the four model classes as our final model.

Logistic regression

We specify a logistic regression model for classification using the “glm” engine. Then create a workflow. After that, we add model and the appropriate recipe (we created before).

log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

log_wkflow <- workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(recipe)

LDA

In a similar process, but this time specify a linear discriminant analysis model for classification using the “MASS” engine.

lda_mod <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS")

lda_wkflow <- workflow() %>% 
  add_model(lda_mod) %>% 
  add_recipe(recipe)

QDA

In a similar process, but this time specify a quadratic discriminant analysis model for classification using the “MASS” engine.

qda_mod <- discrim_quad() %>% 
  set_mode("classification") %>% 
  set_engine("MASS")

qda_wkflow <- workflow() %>% 
  add_model(qda_mod) %>% 
  add_recipe(recipe)

Assess the performance of each of these three models

  • Fit each of the models created before to the folded data.
control <- control_resamples(save_pred = TRUE)

log_fit <- fit_resamples(log_wkflow, newrecords_folds,control = control)

lda_fit <- fit_resamples(resamples = newrecords_folds, 
                         lda_wkflow,  
                         control = control)

qda_fit <- fit_resamples(qda_wkflow, resamples = newrecords_folds,
                         control = control)
  • We will use collect_metrics() to print the mean and standard errors of the performance metric accuracy across all folds for each of the four models.

  • We will decide which of the 3 fitted models has performed the best.

collect_metrics(log_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.627     5 0.00166 Preprocessor1_Model1
## 2 roc_auc  binary     0.651     5 0.00191 Preprocessor1_Model1
collect_metrics(lda_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.627     5 0.00166 Preprocessor1_Model1
## 2 roc_auc  binary     0.651     5 0.00191 Preprocessor1_Model1
collect_metrics(qda_fit)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n  std_err .config             
##   <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
## 1 accuracy binary     0.567     5 0.000701 Preprocessor1_Model1
## 2 roc_auc  binary     0.650     5 0.000842 Preprocessor1_Model1

Based on the results we get here, we can see that these three models have very similar standard error of the accuracy (and the differences between them are very small). Since the mean accuracy of the Logistic regression equals to the mean accuracy of the LDA and both are higher than the mean accuracy of QDA, we know that logistic regression and LDA are better than QDA.

Also, we found that Logistic regression and LDA have the same mean and standard error roc_auc. This means that they do a same job. Let’s just choose Logistic regression as the best models among these three models.

  • Now that we’ve chosen a model, fit our chosen model to the entire training dataset (not to the folds).
log_fit_train <- fit(log_wkflow, newrecords_train)
  • Finally, with your fitted model, use predict(), bind_cols(), and accuracy() to assess your model’s performance on the testing data!
log_test <- fit(log_wkflow, newrecords_test)
predict(log_test, new_data = newrecords_test, type = "class") %>% 
  bind_cols(newrecords_test %>% select(heart_disease)) %>% 
  accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.622

Confusion matrix, ROC curve and AUC values

  • Now, using the testing data, we want to create a confusion matrix and visualize it. Plot an ROC curve and calculate the area under it (AUC).
#  create a confusion matrix and visualize it
augment(log_test, new_data = newrecords_test) %>%
  conf_mat(truth = heart_disease, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

# Plot roc_curve
augment(log_test, new_data = newrecords_test) %>%
  roc_curve(heart_disease, .pred_Yes) %>%
  autoplot()

# Calculate AUC
augment(log_test, new_data = newrecords_test) %>%
  roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.646
  • Based on the results, we find that although Logistic regression is the best among these 3 models, it is still not good enough. We want to see if there is a model work better than Logistic regression in other model class.

Class 2: Random Forest

  • Now, we are going to set up a random forest model and workflow. Use the ranger engine and set importance = “impurity”. Tune mtry, trees, and min_n. Using the documentation for rand_forest().
rf_spec <- rand_forest(mtry = tune(),trees = tune(), min_n = tune()) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("classification")

rf_wf <- workflow() %>%
  add_model(rf_spec) %>% 
  add_recipe(recipe)
  • Since we have 17 predictors, then by the lecture, we know that the range for mtry should not be smaller than 1 or larger than 17.

  • Since our dataset is very large, in order to prevent running for too long, we should set levels = 2, and we set the maximum of number of trees be 1000.

# set-up grid 
param_grid_rf <- grid_regular(mtry(range = c(1, 17)),
                           trees(range = c(10, 1000)), 
                           min_n(range = c(1, 10)),
                           levels = 2)
  • Tune the model and print an autoplot() of the results.
tune_res <- tune_grid(
  rf_wf, 
  resamples = newrecords_folds, 
  grid = param_grid_rf, 
  metrics = metric_set(roc_auc)
)
# print result
autoplot(tune_res)

  • Use collect_metric() and arrange() to find what is the roc_auc of my best_performing random forest model on the folds.
arrange(collect_metrics(tune_res), desc(mean))
## # A tibble: 8 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config             
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     1  1000    10 roc_auc binary     0.650     5 0.00175 Preprocessor1_Model7
## 2     1  1000     1 roc_auc binary     0.647     5 0.00474 Preprocessor1_Model3
## 3     1    10    10 roc_auc binary     0.641     5 0.00548 Preprocessor1_Model5
## 4     1    10     1 roc_auc binary     0.641     5 0.00345 Preprocessor1_Model1
## 5    17    10     1 roc_auc binary     0.627     5 0.00165 Preprocessor1_Model2
## 6    17    10    10 roc_auc binary     0.627     5 0.00165 Preprocessor1_Model6
## 7    17  1000     1 roc_auc binary     0.627     5 0.00167 Preprocessor1_Model4
## 8    17  1000    10 roc_auc binary     0.627     5 0.00167 Preprocessor1_Model8
  • We find that the mean of roc_auc of my best_performing random forest model on the folds is 0.647 with mtry = 1, trees = 1000 and min_n = 10.

  • Use select_best() to choose the model that has the optimal roc_auc. Then use finalize_workflow(), fit(), and augment() to fit the model to the training set and evaluate its performance on the testing set.

best_complexity<- select_best(tune_res, metric= 'roc_auc')

class_forest_final <- finalize_workflow(rf_wf, best_complexity)

class_forest_final_fit <- fit(class_forest_final, data = newrecords_train)

augment(class_forest_final_fit, new_data = newrecords_test) %>%
  accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.622
  • We want to create and visualize a confusion matrix heat map.
augment(class_forest_final_fit, new_data = newrecords_test) %>%
  conf_mat(truth = heart_disease , estimate = .pred_class) %>%
  autoplot(type = "heatmap")

  • Print the ROC curves
# Plot roc_curve
augment(class_forest_final_fit, new_data = newrecords_test) %>%
  roc_curve(heart_disease, .pred_Yes) %>%
  autoplot()

  • Calculate the AUC value of your best-performing model on the testing set
# Calculate AUC
augment(class_forest_final_fit, new_data = newrecords_test) %>%
  roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.645
  • Finally, we see that the AUC value of your best-performing model on the testing set is 0.633.

Class 3 : Boosted tree

  • Now, we are going to set up a boosted tree model and workflow. Use the xgboost engine. We will tune mtry, trees, and min_n. Using the documentation for boost_tree().
boost_tree_spec <- boost_tree(trees = tune(),
                              min_n = tune(),
                              mtry = tune()
                              ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

boost_tree_workflow <- workflow() %>% 
  add_model(boost_tree_spec) %>% 
  add_recipe(recipe)
  • Since we have 17 predictors, then by the lecture, we know that the range for mtry should not be smaller than 1 or larger than 17.

  • Since our dataset is very large, in order to prevent running for too long, we should set levels = 2, and we set the maximum of number of trees be 2000.

# set-up grid 
boost_tree_grid<- grid_regular(mtry(range = c(1, 17)),
                           trees(range = c(10, 2000)), 
                           min_n(range = c(1, 10)),
                           levels = 2)
  • Tune the model and print an autoplot() of the results.
boost_tune_res <- tune_grid(
  boost_tree_workflow, 
  resamples = newrecords_folds, 
  grid = boost_tree_grid, 
  metrics = metric_set(roc_auc),
)
autoplot(boost_tune_res)

  • Use collect_metric() and arrange() to find what is the roc_auc of my best_performing boosted tree model on the folds.
arrange(collect_metrics(boost_tune_res), desc(mean))
## # A tibble: 8 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config             
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     1    10    10 roc_auc binary     0.650     5 0.00138 Preprocessor1_Model3
## 2     1    10     1 roc_auc binary     0.650     5 0.00125 Preprocessor1_Model1
## 3     1  2000    10 roc_auc binary     0.650     5 0.00127 Preprocessor1_Model4
## 4     1  2000     1 roc_auc binary     0.650     5 0.00127 Preprocessor1_Model2
## 5    17    10    10 roc_auc binary     0.650     5 0.00127 Preprocessor1_Model7
## 6    17    10     1 roc_auc binary     0.650     5 0.00127 Preprocessor1_Model5
## 7    17  2000     1 roc_auc binary     0.650     5 0.00127 Preprocessor1_Model6
## 8    17  2000    10 roc_auc binary     0.650     5 0.00127 Preprocessor1_Model8
  • We find that the mean of roc_auc of my best_performing boosted tree model on the folds is 0.651 with mtry = 1, trees = 10 and min_n = 10.

  • Use select_best() to choose the model that has the optimal roc_auc. Then use finalize_workflow(), fit(), and augment() to fit the model to the training set and evaluate its performance on the testing set.

best_boost_complexity<- select_best(boost_tune_res, metric= 'roc_auc')


boost_tree_final <- finalize_workflow( boost_tree_workflow, best_boost_complexity)

boost_tree_final_fit  <- fit(boost_tree_final, data = newrecords_train)

# evaluate its performance on the testing set (estimate the accuracy)
augment(boost_tree_final_fit , new_data = newrecords_test) %>%
  accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.622
  • We want to create and visualize a confusion matrix heat map.
augment(boost_tree_final_fit, new_data = newrecords_test) %>%
  conf_mat(truth = heart_disease , estimate = .pred_class) %>%
  autoplot(type = "heatmap")

  • Print the ROC curves
# Plot roc_curve
augment(boost_tree_final_fit, new_data = newrecords_test) %>%
  roc_curve(heart_disease, .pred_Yes) %>%
  autoplot()

  • Calculate the AUC value of your best-performing model on the testing set
# Calculate AUC
augment(boost_tree_final_fit , new_data = newrecords_test) %>%
  roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.643
  • Finally, we see that the AUC value of your best-performing model on the testing set is 0.646.

Class 4: K Nearest Neighbors

Lastly, I ran repeated cross fold validation on the K Nearest Neighbor model in the same fashion as the previous two models. For nearest neighbor, I tuned only neighbors as the model’s other defaults are fine, as mentioned in lab 07. I also set the workflow and added the recipe.

knn_model <- nearest_neighbor(neighbors = tune(),mode = "classification") %>%
                              set_engine("kknn")
  

knn_workflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(recipe)
  • I set up a tuning grid and defined it.
# set-up tuning grid 
knn_params <- parameters(knn_model)

# define grid
knn_grid <- grid_regular(knn_params, levels = 2)
  • Tune the model and print an autoplot() of the results.
knn_tune <- knn_workflow %>% 
  tune_grid(resamples = newrecords_folds, 
            grid = knn_grid,
            metrics = metric_set(roc_auc))

autoplot(knn_tune)

  • Use collect_metric() and arrange() to find what is the roc_auc of my best_performing KNN model on the folds.
arrange(collect_metrics(knn_tune),desc(mean))
## # A tibble: 2 × 7
##   neighbors .metric .estimator  mean     n  std_err .config             
##       <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>               
## 1        15 roc_auc binary     0.501     5 0.000367 Preprocessor1_Model2
## 2         1 roc_auc binary     0.5       5 0        Preprocessor1_Model1
  • We find that the mean of roc_auc of my best_performing boosted tree model on the folds is 0.501 with neighbors = 15

  • Use select_best() to choose the model that has the optimal roc_auc. Then use finalize_workflow(), fit(), and augment() to fit the model to the training set and evaluate its performance on the testing set.

best_knn_complexity<- select_best(knn_tune, metric= 'roc_auc')


knn_final <- finalize_workflow( knn_workflow, best_knn_complexity)

knn_final_fit  <- fit(knn_final, data = newrecords_train)

# evaluate its performance on the testing set (estimate the accuracy)
augment(knn_final_fit , new_data = newrecords_test) %>%
  accuracy(truth = heart_disease, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.500
  • We want to create and visualize a confusion matrix heat map.
# create and visualize a confusion matrix heat map.
augment(knn_final_fit, new_data = newrecords_test) %>%
  conf_mat(truth = heart_disease , estimate = .pred_class) %>%
  autoplot(type = "heatmap")

  • Print the ROC curves
# Plot roc_curve
augment(knn_final_fit, new_data = newrecords_test) %>%
  roc_curve(heart_disease, .pred_Yes) %>%
  autoplot()

  • Calculate the AUC value of your best-performing model on the testing set
# Calculate AUC
augment(knn_final_fit , new_data = newrecords_test) %>%
  roc_auc(heart_disease, .pred_Yes)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.501

Final model

  • In this project, we have tried four different model types. Now let’s compare the AUC and accuracy values of these four types of models on the testing set and select our final model. (Recall: Logistic regression is the best model among Logistic regression, LDA and QDA models. We won’t show LDA and QDA here since Logistic regression, LDA and QDA come from same model class.)
# create a matrix that contains the accuracy and AUC 
# from Model Logistic regression, Random Forest, Boosted tree, KNN
Model_Name <-c('Logistic regression', 'Random Forest', 'Boosted tree', 'KNN')
Accuracy <- c(0.622,0.622,0.622,0.500)
AUC <- c( 0.646,0.645,0.643,0.501)
information <- cbind(Model_Name,Accuracy,AUC)
# convert it to a tibble
as_tibble(information) 
## # A tibble: 4 × 3
##   Model_Name          Accuracy AUC  
##   <chr>               <chr>    <chr>
## 1 Logistic regression 0.622    0.646
## 2 Random Forest       0.622    0.645
## 3 Boosted tree        0.622    0.643
## 4 KNN                 0.5      0.501
  • Based on report we got above, we can see that Logistic regression is the best model in all these 4 different models since it has both highest accuracy and highest AUC values.
# assignment log_test to final_model
final_model <- log_test
  • In all these 17 key indicators, stroke is the most important variable to our model.In other words, people who have had a stroke are more likely to have heart disease.

Let’s Check A Few Predictions

In order to show you how our model exactly work, let’s use 2 examples to predict whether people will get heart disease or not based on some key indicators.

  • Set up example 1, and do a prediction
# set up example 1 
qualification1 <- data.frame(
  bmi = 24,
  smoking = 'No',
  alcohol_drinking = 'No',
  stroke = 'No',
  physical_health = 0,
  mental_health = 4,
  diff_walking = 'No',
  sex = 'Male',
  age_category = '18-24',
  race = 'Asian',
  diabetic = 'No',
  physical_activity = 'Yes',
  gen_health  ='Excellent',
  sleep_time = 9,
  asthma = 'No',
  kidney_disease = 'No',
  skin_cancer = 'No'
) %>% 
mutate(
    #heart_disease  = factor(heart_disease, levels = c('Yes', 'No')),
    smoking  = factor(smoking, levels = c('Yes', 'No')),
    alcohol_drinking  = factor(alcohol_drinking, levels = c('Yes', 'No')),
    stroke  = factor(stroke, levels = c('Yes', 'No')),
    diff_walking  = factor(diff_walking, levels = c('Yes', 'No')),
    sex  = factor(sex),
    age_category  = factor(age_category),
    race  = factor(race, levels = c("American Indian/Alaskan Native", "Asian", "Black", "Hispanic", "White", "Other")),
    diabetic  = factor(diabetic, levels = c("No", "No, borderline diabetes", "Yes", "Yes (during pregnancy)")),
    physical_activity  = factor(physical_activity),
    gen_health  = factor(gen_health, levels = c("Excellent","Very good", "Good", "Fair", "Poor" )),
    asthma  = factor(asthma, levels = c('Yes', 'No')),
    kidney_disease  = factor(kidney_disease, levels = c('Yes', 'No')),
    skin_cancer  = factor(skin_cancer, levels = c('Yes', 'No')),
  )
# do a prediction
predict(final_model, qualification1)
## # A tibble: 1 × 1
##   .pred_class
##   <fct>      
## 1 No
  • Set up example 2, and do a prediction
# set up example 2
qualification2 <- data.frame(
  bmi = 40,
  smoking = 'Yes',
  alcohol_drinking = 'Yes',
  stroke = 'Yes',
  physical_health = 30,
  mental_health = 30,
  diff_walking = 'Yes',
  sex = 'Female',
  age_category = '75-79',
  race = 'Asian',
  diabetic = 'Yes',
  physical_activity = 'No',
  gen_health  ='Poor',
  sleep_time = 6,
  asthma = 'Yes',
  kidney_disease = 'Yes',
  skin_cancer = 'Yes'
) %>% 
mutate(
    smoking  = factor(smoking, levels = c('Yes', 'No')),
    alcohol_drinking  = factor(alcohol_drinking, levels = c('Yes', 'No')),
    stroke  = factor(stroke, levels = c('Yes', 'No')),
    diff_walking  = factor(diff_walking, levels = c('Yes', 'No')),
    sex  = factor(sex),
    age_category  = factor(age_category),
    race  = factor(race, levels = c("American Indian/Alaskan Native", "Asian", "Black", "Hispanic", "White", "Other")),
    diabetic  = factor(diabetic, levels = c("No", "No, borderline diabetes", "Yes", "Yes (during pregnancy)")),
    physical_activity  = factor(physical_activity),
    gen_health  = factor(gen_health, levels = c("Excellent","Very good", "Good", "Fair", "Poor" )),
    asthma  = factor(asthma, levels = c('Yes', 'No')),
    kidney_disease  = factor(kidney_disease, levels = c('Yes', 'No')),
    skin_cancer  = factor(skin_cancer, levels = c('Yes', 'No')),
  )
# do a prediction
predict(final_model, qualification2)
## # A tibble: 1 × 1
##   .pred_class
##   <fct>      
## 1 Yes

Conclusion

In order to generate a model that will predict whether people will get heart disease or not based on some key indicators (these indicators are the 17 variables predictors in our dataset), we used a an undersampling technique to deal with survey data of over 300k US residents from the year 2020. We tried four different types of models and tested a total of seven models. (They are Logistic regression, LDA, QDA, Random Forest, Boosted tree and KNN.)

The following table shows the accuracy and AUC values of our fitted models.

# create a matrix that contains the accuracy and AUC 
# from Model Logistic regression, Random Forest, Boosted tree, KNN
Model_Name <-c('Logistic regression', 'Random Forest', 'Boosted tree', 'KNN')
Accuracy <- c(0.622,0.622,0.622,0.500)
AUC <- c( 0.646,0.645,0.643,0.501)
information <- cbind(Model_Name,Accuracy,AUC)
# convert it to a tibble
as_tibble(information) 
## # A tibble: 4 × 3
##   Model_Name          Accuracy AUC  
##   <chr>               <chr>    <chr>
## 1 Logistic regression 0.622    0.646
## 2 Random Forest       0.622    0.645
## 3 Boosted tree        0.622    0.643
## 4 KNN                 0.5      0.501

Finally, we found that the logistic regression model performed well since it has the highest accuracy and AUC value, and KNN performed poorly since it has the lowest accuracy and AUC value. Although the logistic regression is the best model in all these models, the accuracy of our final model (logistic regression) is only 62.2%, which does not look very good.

I think the reason why the accuracy is not very high is that we used the understampling technique at the process of data cleaning. Our original dataset is very imbalanced. There is 292422 people don’t have heart disease, and only 27373 people have heart disease. We used a understampling technique to randomly delete 265035 observations without heart disease. Note that the data we deleted accounted for 90.6% of the original data without heart disease. Using the understampling technique to delete a large amount of data will inevitably reduce the accuracy of our model. I think this can explain why the accuracy of our model is only 62.2% and AUC value is only 0.646.

Overall, we generated a model that will predict whether people will get heart disease or not based on some key indicators in this project. By using this model, we can remind people to pay attention to health and lifestyle habits to help them avoid heart disease. Hope we can all stay away from heart disease and have a healthy and happy life.